PCA: Income Data with R

Using tidymodels

1 Introduction

This document demonstrates how to perform Principal Component Analysis (PCA) in R using the tidymodels framework. PCA is a dimensionality reduction technique that transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. We will use the adult_income_dataset.csv for this demonstration.

2 Load Data

First, we load the necessary libraries and the income dataset.

Code
library(tidyverse)
library(tidymodels)
library(factoextra)

income_data <- read_csv("../data/adult_income_dataset.csv")

# For simplicity, we'll remove rows with any missing values and the 'income' column
income_data_clean <- income_data %>%
  select(-income) %>%
  na.omit() %>%
  sample_n(1000) # Randomly sample 10,000 rows

# Preprocessing using recipes
income_recipe <- recipe(~ ., data = income_data_clean) %>%
  step_rm(all_nominal_predictors()) %>% # One-hot encode all nominal (categorical) predictors
  step_normalize(all_numeric_predictors()) %>% # Normalize all numerical predictors
  prep(training = income_data_clean)

income_data_processed <- bake(income_recipe, new_data = income_data_clean)

# Remove any columns that might have resulted in all zeros after one-hot encoding if they were constant
income_data_processed <- income_data_processed[, colSums(income_data_processed) != 0]
Code
glimpse(income_data_processed)
Rows: 1,000
Columns: 6
$ age              <dbl> 1.30127420, 0.49702562, -0.89213102, -1.25769855, 0.4…
$ fnlwgt           <dbl> 0.14594412, 0.02091497, -0.72304477, 0.47604431, 1.46…
$ `education-num`  <dbl> 0.00320406, 0.00320406, 1.60523414, 0.40371158, 1.204…
$ `capital-gain`   <dbl> -0.1346700, -0.1346700, -0.1346700, -0.1346700, 1.020…
$ `capital-loss`   <dbl> -0.2088594, -0.2088594, -0.2088594, -0.2088594, -0.20…
$ `hours-per-week` <dbl> -0.08372406, -0.08372406, 0.31918287, -0.08372406, 0.…

3 Principal Component Analysis

We will perform PCA on the preprocessed income data.

Code
# Perform PCA
res.pca <- prcomp(income_data_processed, scale = FALSE) # Data is already scaled

# Visualize eigenvalues (scree plot)
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

4 PCA Results Visualization

We can visualize the results of PCA, including the contribution of variables to the principal components and the individuals’ positions.

Code
# Plot of individuals on the first two principal components
fviz_pca_ind(res.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE,     # Avoid text overlapping
             ggtheme = theme_minimal()
             )

Code
# Plot of variables
fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE,     # Avoid text overlapping
             ggtheme = theme_minimal()
             )

Code
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

Code
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

Code
# 2D scatter plot of individuals on the first two principal components (similar to Python)
# Extract the principal components
pca_coords <- as.data.frame(res.pca$x)

# Plot using ggplot2
ggplot(pca_coords, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.6) +
  labs(title = "PCA of Income Data (First Two Components)",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

5 Conclusion

This document provided an overview of Principal Component Analysis in R using tidymodels. We demonstrated how to perform PCA on the income dataset and visualize the results.If your data is mostly categorical, PCA is often not the best choice. Use these instead.

6 Keeping Top 5 Components

We can select and work with a reduced number of principal components, for example, the top 5 components that explain a significant portion of the variance.

Code
# Get the principal components (coordinates of the individuals)
head(res.pca$x[, 1:5])
             PC1        PC2        PC3         PC4         PC5
[1,] -0.30820157 -0.7034257  0.4603546  0.21892956  0.20877998
[2,] -0.04308345 -0.2088990  0.2588782  0.02618587  0.05519673
[3,] -0.76839229  1.2796291 -0.4834717 -0.74272659 -1.02196672
[4,]  0.59105267  0.9846954 -0.6859196  0.11686084 -0.41102262
[5,] -0.86178201  0.4203588 -0.6879925  1.72778722 -0.15791388
[6,] -0.06130784 -0.9194863  2.0712145 -0.42712528  0.03847342
Code
# Get the loadings (eigenvectors)
head(res.pca$rotation[, 1:5])
                      PC1        PC2        PC3         PC4         PC5
age            -0.3845978 -0.6119846  0.3307280  0.13142952  0.16486731
fnlwgt          0.3534702 -0.0187077 -0.5159685  0.69617116  0.16787246
education-num  -0.5033651  0.3202052 -0.3236615  0.01110994 -0.63251341
capital-gain   -0.4616662  0.1384811  0.2893874  0.66679350  0.01273139
capital-loss   -0.1943652 -0.6533350 -0.5481655 -0.08689269 -0.16162932
hours-per-week -0.4720735  0.2767127 -0.3679786 -0.21397349  0.71991546
Code
# Visualize the variance explained by the first 5 components
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50), ncp = 5)